Identification of stress-related genes by co-expression network analysis based on the improved turbot genome

Turbot (Scophthalmus maximus), commercially important flatfish species, is widely cultivated in Europe and China. With the continuous expansion of the intensive breeding scale, turbot is exposed to various stresses, which greatly impedes the healthy development of turbot industry. Here, we present an improved high-quality chromosome-scale genome assembly of turbot using a combination of PacBio long-read and Illumina short-read sequencing technologies. The genome assembly spans 538.22 Mb comprising 27 contigs with a contig N50 size of 25.76 Mb. Annotation of the genome assembly identified 104.45 Mb repetitive sequences, 22,442 protein-coding genes and 3,345 ncRNAs. Moreover, a total of 345 stress responsive candidate genes were identified by gene co-expression network analysis based on 14 published stress-related RNA-seq datasets consisting of 165 samples. Significantly improved genome assembly and stress-related candidate gene pool will provide valuable resources for further research on turbot functional genome and stress response mechanism, as well as theoretical support for the development of molecular breeding technology for resistant turbot varieties.


Background & Summary
Scophthalmus maximus (FishBase ID: 1348), as known as turbot, an economically important flatfish (Pleuronectiformes), is native to Northeast Atlantic throughout the Mediterranean and along the European coasts to Arctic Circle 1 , and now is the most widely cultivated commercial flatfish around the world with the highest annual aquaculture production 1,2 . Since its firstly introduction into China in 1992, turbot aquaculture industry has made great progress, leading to the rise of the fourth wave of mariculture industry in China 2 . However, turbot was affected by various biotic and abiotic stresses during the breeding process, which seriously threatened the healthy development of turbot aquaculture industry and caused huge economic losses. Therefore, carrying out research on the resistance of turbot and obtaining genetic resources related to stress resistance will contribute to the research on the resistance molecular mechanism of turbot and provide theoretical support for the subsequent genetic improvement of turbot germplasm.
In recent years, numerous RNA-seq studies have been conducted to explore the stress responsive genes and molecular mechanisms under various stresses, such as pathogens stress (Enteromyxum scophthalmi 3,4 , Vibrio anguillarum 5 ), heat stress 6 , oxygen stress 7 , crowding stress 8 , salinity stress 9 , and feeding stress 10 . All these researches were solely focused on the identification of differentially expressed genes (DEGs), whereas connectivity analysis has not yet been taken into account. Instead of focusing only on DEGs, gene co-expression network (GCN) analysis provides new insight into the identification of co-expressed gene modules, their correlation with specific traits, and the pinpointing of key hub genes 11,12 , which cannot be detected by standard transcriptome www.nature.com/scientificdata www.nature.com/scientificdata/ analysis. This powerful approach has been widely applied to detect diverse stresses response in Nibea albiflora 13 , Oysters 14 , Scophthalmus maximus 6 , etc. In this study, we reported an improved high-quality chromosome-scale genome assembly of turbot combing PacBio single molecule sequencing technique (SMRT) and Illumina short-read sequencing technologies. Based on this improved genome assembly, we re-annotated the protein-coding genes, repetitive sequences and ncRNAs. In addition, we re-analyzed multiple stress-related RNA-seq datasets from National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) database by gene co-expression network analysis, and identified multiple gene modules and candidate genes response to various stresses in turbot. Taken together, these resources will not only serve as key resources for studying genomics and further research into the stress response mechanisms, but will also promote the progress of genetic improvement and comprehensive stress-resistant molecular breeding of turbot.

Methods
Turbot samples and genome sequencing. Genomic DNA was extracted from the muscle samples of a super-female (WW) turbot using Puregene Tissue Core Kit A (Qiagen, USA) according to the manufacturer's instruction. The quality of the extracted genomic DNA was checked using electrophoresis on 1% agarose gel and the concentration was quantified using a NanoDrop 2000 to ensure the DNA samples met libraries sequencing requirements.
The extracted DNA molecules were firstly used to construct an Illumina pair-end (PE) library with 350 bp insert size using standard protocols provided by Illumina (San Diego, CA, USA). The PE library was then sequenced using the Illumina HiSeq 4000 platform with 150 bp PE mode according to the manufacturer's instructions. Finally, a total of 51.80 Gb raw reads, accounting ~90X coverage of whole genome, were generated (Table 1).
We also constructed a 20 kb PacBio library following the PacBio manufacturing protocols (Pacific Biosciences, CA, USA) and sequenced it using the PacBio Sequel II platform with the continuous long-read (CLR) mode following the manufacturer's instruction. In total, we obtained 150.30 Gb (~265X) PacBio long reads ( Table 1). The average and N50 lengths of the subreads were 14.13 kb and 25.47 kb, respectively. Genome assembly. Long reads generated from the PacBio Sequel II platform were firstly processed by a self-correction of errors using Canu 15 with default parameters. And then corrected reads were subsequently assembled by Flye (v2.7) 16 (--pacbio-corr --threads 80 --genome-size 568 m). To obtain the final assembly, the draft assembly was removed haplotypic duplication by purge_dups 17 and polished by gcpp (https://github.com/ PacificBiosciences/gcpp) with default parameters using PacBio data, then Pilon 18 (--fix bases) was used to further polish the genome using Illumina data (Fig. 1a). Finally, we obtained a new assembled genome of turbot containing 27 contigs with a total length of 538.22 Mb and a contig N50 length of 25.76 Mb, exhibiting higher contiguity and completeness comparable to other published turbot genomes [19][20][21] (Table 2). In addition, GC content of the genome assembly was estimated to be 43.53%.  Genome annotation. We detected and classified repetitive sequences in the final turbot genome assembly by a combination of homology-based and de novo prediction strategies. In homology-based searching, known repeats were identified using RepeatMasker (V4.1.1) 22 based on the RepBase TE library (version 10/26/2018) 23 . In addition, de novo prediction was conducted using RepeatMasker to further detect novel repeats, which based on the de novo repeats library of the turbot genome constructed with RepeatModeler (http://www.repeatmasker. org/RepeatModeler/) and LTR-FINDER 24 . Finally, a total of 104.45 Mb of non-redundant repetitive sequences (Combined TEs) were obtained, accounting for 19.41% of the assembled genome (Table 3). Amid predominant repeats, DNA transposons were the most abundant (54.16 Mb), representing 10.06% of the genome, followed by long interspersed elements (LINEs, 3.10%), long terminal repeats (LTRs, 2.95%) and short interspersed nuclear elements (SINEs, 0.51%) ( Table 3). Protein-coding gene annotations were then conducted with MAKER (v3.01.03) 25 by a combined strategy of homology-based, de novo, and transcriptome-assisted predictions. For homology-based prediction, protein sequences of seven teleost species, Anabas testudineus, Cynoglossus semilaevis, Danio rerio, Gasterosteus aculeatus, Oryzias latipes, Scophthalmus maximux, Takifugu rubripes, were downloaded from Ensembl and NCBI, and mappped to turbot genome using TBLASTN 26 (e-value ≤ 1e-5). Exonerate (v2.4.0) 27 was used to align homologous protein sequences to turbot genome. Homologous genes were predicted ranging from 35,093 to 48,770 in above species reference sequences (Table 4). For de novo prediction, Augustus 28 and Genscan 29 were employed to analyze the repeats masked genome, which detected 30,320 and 40,007 genes, respectively (Table 4). For transcriptome-assisted prediction, RNA-seq data (NCBI accession number: SRP261889, SRP273870) were aligned to turbot genome to identify potential gene structures, and 16,356 genes were supported. Finally, we performed MAKER (v3.01.03) to integrate genes generated by above predictions to produce a consensus protein-coding gene set consisting of 22,442 genes with an average gene length of 15,828 bp ( Table 4). Comparisons of gene features between turbot and other seven species indicated similar distribution patterns in average length of gene, coding sequence (CDS), exon and intron (Fig. 2).
To obtain functional annotation of the predicted protein-coding genes in turbot genome, InterPro 30 , Pfam 31 , Swissprot 32 and TrEMBL 32 databases were respectively used to predict protein function based on the conserved protein domains by InterProScan (v5.46) 33 . BLASTP (e-value ≤ 1e-5) was used for the homolog search in multiple databases, such as Gene Ontology (GO) 34 , Kyoto Encyclopedia of Genes and Genomes (KEGG) 35 , and NCBI non-redundant protein (NR) 36 databases. Ultimately, a total of 21,360 genes (95.18% of all predicted genes) could be functionally annotated by at least one of the abovementioned databases (Table 5).
For non-coding genes, a total of 1,796 tRNAs were identified using tRNAscan-SE 37 . Moreover, 538 rRNAs were detected through searching for homology against rRNA sequences of related species using BLASTN. Besides, 430 miRNAs and 581 snRNAs were predicted using INFERNAL 38    www.nature.com/scientificdata www.nature.com/scientificdata/ transcriptome profiling in turbot under different stresses (i.e., crowding, feeding, heat, oxygen, pathogens, and salinity) were downloaded from the NCBI SRA database using SRAtoolkit (v2.11.0) 39 . Following, RNA-seq data in SRA format were converted into FASTQ format using fastq-dump tool of SRAtoolkit. Then, reads were aligned to the latest assembled turbot genome using STAR 40 with default parameters. TPMCalculator (-q 1) 41 was used to calculate transcripts per million (TPM) values for all genes using sorted bam files obtained from reads alignment. Subsequently, we used BioNERO 42 to preprocess the gene expression data according to the following steps: I) Replacing missing values (NAs) with 0 using replace_na function; II) Removing the genes whose average gene expression was less than 1 with remove_nonexp function; III) Removing outlying samples with ZKfiltering function; IV) Adjusting for confounding artifacts with PC_correction function to make every gene follow an approximate normal distribution. After filtering and processing (Fig. 1b), a normalized gene expression matrix consisting of 12,271 genes with medial expression value ≥ 1 from 160 RNA-seq samples were obtained.
After we filtered and normalized the expression data, BioNERO 42 was used to construct a gene co-expression network (GCN) (Fig. 1b). First of all, we identified the most optimal β power to make the network satisfy the scale-free topology with the function SFT_fit. According to the result, the optimal power is 11, for which the scale-free topology fit index (R 2 ) reaches 0.8 and mean connectivity tends to 0. Next, we used the exp2gcn  Table 4. General statistics of predicted protein-coding genes in S. maximus genome. www.nature.com/scientificdata www.nature.com/scientificdata/ function to infer the GCN with power 11. As a result, a total of 24 co-expression modules were eventually identified (Fig. 3a), with the number of genes per module ranging from 34 (magenta) to 4,396 (midnightblue).
We then identified modules that were extremely significant (p-value < 0.01) positively or negatively correlated with particular traits (stresses) by calculating module-trait spearman correlation coefficients using the   Table 6. General statistics of non-coding annotation of S. maximus.  Table 7. Overview of the RNA-seq datasets used in this study.

Platform (Illumina) Size (GB) References
www.nature.com/scientificdata www.nature.com/scientificdata/ module_trait_cor function from BioNERO. As shown in Fig. 3b, significantly related modules could be found for each trait, which provides rich resources for the study of turbot stress resistance mechanism. To detect the functionality of the modules that extremely significant correlated with each stress, for each module, GO and KEGG enrichment analyses were performed on all genes in the module using TBtools 43 (corrected p-value (BH method) < 0.5).
Identification of key hub genes. To identify candidate key hub genes related to every stress, we firstly constructed hub genes set. Hub genes, defined as the top 10% genes with highest degree (i.e., sum of connection weights of a gene to all other genes in the module) that have module membership (MM) (i.e., correlation of a gene to its module eigengene) > 0.8, were identified using the function get_hubs_gcn. Then, hub genes belonging to modules that were extremely significant associated with same stress were merged as hub genes set for this stress. Following, we set up the differentially expressed genes (DEGs) set. Firstly, we used featureCounts 44 software program in Subread 45 package to construct reads count matrixes. Then, edgeR 46 was used to identify DEGs with false discovery rate (FDR) < 0.05 and |log 2 FC| > 1. DEGs, related to the same stress, were merged as DEGs set for this stress. Finally, genes, included in both hub genes set and DEGs set corresponding to the same stress, were defined www.nature.com/scientificdata www.nature.com/scientificdata/ as the candidate key hub genes for this stress. Candidate key hub genes related to crowding, feeding, heat, oxygen, pathogens and salinity stress were 0, 128, 40, 7, 90 and 80, respectively.

Heat-related modules enrichment analysis and identification of key hub genes. To heat stress,
GO enrichment analyses illustrated that metabolic process, cellular process, catabolic process, catalytic activity, hydrolase activity, oxidoreductase activity, cellular response to stress, biosynthetic process, and binding were the significantly enriched terms (GO enrichment.xlsx 47 ) in modules that were extremely significant correlated with heat stress. Meanwhile, KEGG enrichment analyses were employed on the same modules, and the results manifested that metabolism, lipid metabolism, carbohydrate metabolism, glycolysis/gluconeogenesis, PPAR signaling pathway, proteasome, digestive system, fat digestion and absorption, peroxisome, cell growth and death, transport and catabolism, cellular processes, and protein kinases were the significantly enriched pathways (KEGG enrichment.xlsx 47 ). Finally, we identified 40 candidate heat-related key hub genes, of which 7 genes including ABI1 48 , CD44 49 , CCDC153 50 , G2e3 51 , PATJ 52 , HYKK 53 and occludin 54 has been verified to contribute to heat stress. For instance, G2e3 was one of the candidate genes in the liver of heat-treated large yellow croaker 51 . Exposure to heat stress (39 °C or 41 °C) resulted in increased expression of occludin protein in Caco-2 cells 54 .
oxygen-related modules enrichment analysis and identification of key hub genes. To oxygen stress, GO enrichment results showed that aerobic respiration, aerobic electron transport chain, metabolic process, oxidoreductase activity, mitochondrial inner membrane, mitochondrial respirasome, respiratory chain complex, oxidative phosphorylation, oxidoreductase complex, catabolic process, catalytic activity, response to stress, response to external stimulus were the significantly enriched terms (GO enrichment.xlsx 47 ) in modules that were extremely significant correlated with oxygen stress. Furthermore, we employed KEGG enrichment analyses on the same modules, and the results demonstrated that metabolism, oxidative phosphorylation, environmental adaptation, energy metabolism, and peroxisome were significantly enriched pathways (KEGG enrichment.xlsx 47 ). In addition, we obtained 7 candidate oxygen-related hub genes, among which AMBP 55 and CNN1 56 had been confirmed to conduce to heat stress. Such as, the gene for A1M is denoted AMBP, which has a physiological role as a protective antioxidant 55 . Five percent oxygen concentration significantly increased the expression levels of CNN1 in adipose-derived stem cell cultures after 2 weeks of induction 56 .

Pathogens-related modules enrichment analysis and identification of key hub genes.
To pathogens stress, GO enrichment results showed that immune response, immune system process, response to wound healing, blood coagulation, hemostasis, response to external stimulus, response to stress, biosynthetic process, catalytic activity, metabolic process, and cellular process were the significantly enriched terms (GO enrichment. xlsx 47 ) in modules that were extremely significant related with pathogens stress. Meanwhile, KEGG enrichment analyses were employed on the same modules, and the results manifested that immune system, human diseases, complement and coagulation cascades, CD molecules, lysosome, phagosome, B cell receptor signaling pathway, hematopoietic cell lineage metabolism, glycosaminoglycan binding proteins, exosome, neutrophil extracellular trap formation, were the significantly enriched pathways (KEGG enrichment.xlsx 47 ). Ultimately, we determined 90 candidate pathogens-relevant hub genes, thereinto, 18 genes, such as CMKLR1 57 73 , and lysozyme 74 , had been proved to be conducive to pathogens stress. For example, LILRA6 is essential for macrophage-mediated immune responses and it has the potential to complement the innate and adaptive immune system against pathogens 64 . Siglec-15 probably plays a conserved, regulatory role in the immune system of vertebrates 68 . In addition to its direct antimicrobial role, more recent evidence has shown that lysozyme modulates the host immune response to infection 74 .
Salinity-related modules enrichment analysis and identification of key hub genes. To salinity stress, according to GO enrichment analyses, terms (GO enrichment.xlsx 47 ), such as, ion binding, small molecule binding, anion binding, proteasome complex, proteasome-activating activity, catabolic process, metabolic process, cellular process, cellular response to stress, binding, and ATP binding were significantly enriched in modules that were extremely significant related with salinity stress. Simultaneously, we employed KEGG enrichment analyses on the same modules, and the results indicated that proteasome pathway, protein kinases were the significantly enriched pathways (KEGG enrichment.xlsx 47 ). After taking the intersection of hub genes set and DEGs set, 80 genes were defined as candidate salinity-associated hub genes, and six genes (NDUFV1 75 , EMSY 76 , RBBP6 77 , ATF2 78 , Map3k7 79 and PSMC2 80 ) had been certified to be related to salinity stress. For example, RBBP6 was one of the identified candidate genes for freshwater vs. marine adaptation in threespine stickleback 77 . MAP3K7, also known as TAK1, could be highly activated by osmotic stress 79 .

Data Records
The raw data, including Illumina and PacBio sequencing data of the whole genome, was submitted to the NCBI SRA with accession number SRP352610 81 . The final genome assembly and annotation gff file are available at National Genomics Data Center with accession number GWHBHEA00000000.1 82 . The final genome assembly is also available through NCBI with accession number GCA_022379125.1 83 . The functional annotation of protein-coding genes, gene expression matrix used for gene co-expression network inference, and gene co-expression network analysis results including genes per module, hub genes set, DEGs set, GO and KEGG enrichment, key hub genes, are available at Figshare 47 . www.nature.com/scientificdata www.nature.com/scientificdata/ Technical Validation Quality and completeness assessment of genome assembly. The quality and completeness of the new assembly were evaluated through three independent approaches. Firstly, the base-level accuracy and completeness were estimated using Merqury 84 by comparing k-mers in the assembly to those found in the high-accuracy Illumina reads. The results revealed that per-base accuracy rates for turbot assembly was 0.999994 and completeness value was 99.38%. Secondly, the completeness of the final assembled genome was also assessed using Benchmarking Universal Single-Copy Orthologs (BUSCO v4.1.6) 85 with 4,584 single-copy orthologs from actinopterygii_odb9 database. BUSCO analysis revealed that 97.4% (4,465) complete BUSCOs (94.9% single-copy and 2.5% duplicated BUSCOs) and 1.1% (49) fragmented BUSCOs were identified in the assembled genome of turbot. Thirdly, we further evaluated the assembly quality using Inspector 86 by aligning PacBio long reads to the assembled contigs for generating read-to-contig alignment and performing downstream assembly evaluation. As a result, read-to-contig mapping rate and quality value (QV) were 91.93% and 45.41, respectively. All these indicators suggested a high-quality and complete genome assembly for the further research in genetics and genomics of turbot.

Code availability
The data analysis methods, software and associated parameters used in this study are described in the Methods section. Default parameters were applied if no parameter was described. No custom scripts were generated in this work.